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Abstract 

As the cloud computing paradigm has gained prominence, the need for verifiable computation has grown in- 
creasingly urgent. The concept of verifiable computation enables a weak client to outsource difficult computations 
to a powerful, but untrusted, server. Protocols for verifiable computation aim to provide the client with a guarantee 
that the server performed the requested computations correctly, without requiring the client to perform the requested 
computations herself. By design, these protocols impose a minimal computational burden on the client. However, 
existing protocols require the server to perform a very large amount of extra bookkeeping, on top of the requested 
computations, in order to enable a client to easily verify the results. Verifiable computation has thus remained a 
theoretical curiosity, and protocols for it have not been implemented in real cloud computing systems. 

In this paper, our goal is to leverage GPUs to reduce the server-side slowdown for verifiable computation. To this 
end, we identify abundant data parallelism in a state-of-the-art general-purpose protocol for verifiable computation, 
originally due to Goldwasser, Kalai, and Rothblum llOj , and recently extended by Cormode, Mitzenmacher, and 
Thaler (8|. We implement this protocol on the GPU, and we obtain 40-120X server-side speedups relative to a state- 
of-the-art sequential implementation. For benchmark problems, our implementation thereby reduces the slowdown 
of the server to within factors of 1 00-500 x relative to the original computations requested by the client. Furthermore, 
we reduce the already small runtime of the client by lOOx. Similarly, we obtain 20-50x server-side and client- 
side speedups for related protocols targeted at specific streaming problems. We believe our results demonstrate the 
immediate practicality of using GPUs for verifiable computation, and more generally, that protocols for verifiable 
computation have become sufficiently mature to deploy in real cloud computing systems. 

1 Introduction 

A potential problem in outsourcing work to commercial cloud computing services is trust. If we store a large dataset 
with a server, and ask the server to perform a computation on that dataset - for example, to compute the eigenvalues 
of a large graph, or to compute a linear program on a large matrix derived from a database - how can we know the 
computation was performed correctly? Obviously we don't want to compute the result ourselves, and we might not 
even be able to store all the data locally. Despite these constraints, we would like the server to not only provide us 
with the answer, but to convince us the answer is correct. 

Protocols for verifiable computation offer a possible solution to this problem. The ultimate goal of any such proto- 
col is to enable the client to obtain results with a guarantee of coiTectness from the server much more efficiently than 
performing the computations herself. Another important goal of any such protocol is to enable the server to provide 
results with guarantees of coiTectness almost as efficiently as providing results without guarantees of correctness. 

Interactive proofs are a powerful family of protocols for establishing guarantees of correctness between a client 
and server. Although they have been studied in the theory community for decades, there had been no significant efforts 
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to implement or deploy such proof systems until very recently. A recent line of work (e.g., |l5]|6]|71|8l|9][T0][T9l) has 
made substantial progress in advancing the practicality of these techniques. In particular, prior work of Cormode, 
Mitzenmacher, and Thaler [8] demonstrates that: (1) a powerful general-purpose methodology due to Goldwasser, 
Kalai and Rothblum |10| approaches practicality; and (2) special-purpose protocols for a large class of streaming 
problems are already practical. 

In this paper, we clearly articulate this line of work to researchers outside the theory community. We also take 
things one step further, leveraging the parallelism offered by GPUs to obtain significant speedups relative to state- 
of-the-art implementations of |8|. Our goal is to invest the parallelism of the GPU to obtain correctness guarantees 
with minimal slowdown, rather than to obtain raw speedups, as is the case with more traditional GPU applications. 
We believe the insights of our GPU implementation could also apply to a multi-core CPU implementation. However, 
GPUs are increasingly widespread, cost-effective, and power-efficient, and they offer the potential for speedups in 
excess of those possible with commodity multi-core CPUs lfT7l[T4l . 

We obtain server-side speedups ranging from 40-120x for the general -purpose protocol due to Goldwasser et 
al. iW], and 20-50x speedups for related protocols targeted at specific streaming problems. Our general-purpose 
implementation reduces the server-side cost of providing results with a guarantee of correctness to within factors 
of lOO-SOOx relative to a sequential algorithm without guarantees of correctness. Similarly, our implementation of 
the special-purpose protocols reduces the server-side slowdown to within 10-lOOx relative to a sequential algorithm 
without guarantees of correctness. 

We believe the additional costs of obtaining correctness guarantees demonstrated in this paper would already be 
considered modest in many correctness-critical applications. For example, at one end of the application spectrum is 
Assured Cloud Computing for military contexts: a military user may need integrity guarantees when computing in 
the presence of cyber attacks, or may need such guarantees when coordinating critical computations across a mixture 
of secure military networks and insecure networks owned by civilians or other nations |1 1. At the other end of the 
spectrum, a hospital that outsources the processing of patients' electronic medical records to the cloud may require 
guarantees that the server is not dropping or corrupting any of the records. Even if every computation is not explicitly 
checked, the mere ability to check the computation could mitigate trust issues and stimulate users to adopt cloud 
computing solutions. 

Our source code is available at 11201 . 

2 Background 

2.1 What are interactive proofs? 

Interactive proofs (IPs) were introduced within the computer science theory community more than a quarter century 
ago, in seminal papers by Babai 111] and Goldwasser, Micali and Rackoff [3|. In any IP, there are two parties: a 
prover V, and a verifier V. V is typically considered to be computationally powerful, while V is considered to be 
computationally weak. 

In an IP, V solves a problem using her (possibly vast) computational resources, and tells V the answer. V and 
V then have a conversation, which is to say, they engage in a randomized protocol involving the exchange of one or 
more messages between the two parties. The term interactive proofs derives from the back-and-forth nature of this 
conversation. During this conversation, T^'s goal is to convince V that her answer is correct. 

IPs naturally model the problem of a client (whom we model as V) outsourcing computation to an untrusted server 
(who we model as V). That is, IPs provide a way for a client to hire a cloud computing service to store and process 
data, and to efficiently check the integrity of the results returned by the server This is useful whenever the server 
is not a trusted entity, either because the server is deliberately deceptive, or is simply buggy or inept. We therefore 
interchange the terms server and prover where appropriate. Similarly, we interchange the terms client and verifier 
where appropriate. 

Any IP must satisfy two properties. Roughly speaking, the first is that if V answers correctly and follows the 
prescribed protocol, then V will convince V to accept the provided answer The second property is a security guarantee, 
which says that if V is lying, then V must catch V in the lie and reject the provided answer with high probability. A 
trivial way to satisfy this property is to have V compute the answer to the problem herself, and accept only if her answer 
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Figure 1 ; High-level depiction of an execution of the GKR protocol. 



matches T^'s. But this defeats the purpose of having a prover The goal of an interactive proof system is to allow V to 
check T^'s answer using resources considerably smaller than those required to solve the problem from scratch. 

At first blush, this may appear difficult or even impossible to achieve. However, IPs have turned out to be surpris- 
ingly powerful. We direct the interested reader to ^ Chapter 8] for an excellent overview of this area. 



2.2 How do interactive proofs work? 

At the highest level, many interactive proof methods (including the ones in this paper) work as follows. Suppose the 
goal is to compute a function / of the input x. 

First, the verifier makes a single streaming pass over the input x, during which she extracts a short secret s. This 
secret is actually a single (randomly chosen) symbol of an error-corrected encoding Enc{x) of the input. To be clear, 
the secret does not depend on the problem being solved; in fact, for many interactive proofs, it is not necessary that 
the problem be determined until after the secret is extracted. 

Next, V and V engage in an extended conversation, during which V sends V various challenges, and V responds 
to the challenges (see Figure[T]for an illustration). The challenges are all related to each other, and the verifier checks 
that the prover's responses to all challenges are internally consistent. 

The challenges are chosen so that the prover's response to the first challenge must include a (claimed) value for the 
function of interest. Similarly, the prover's response to the last challenge must include a claim about what the value of 
the verifier's secret s should be. If all of V's responses are internally consistent, and the claimed value of s matches 
the true value of s, then the verifier is convinced that prover followed the prescribed protocol and accepts. Otherwise, 
the verifier knows that the prover deviated at some point, and rejects. From this point of view, the purpose of all 
intermediate challenges is to guide the prover from a claim about f{x) to a claim about the secret s, while maintaining 
V's control over V. 

Intuitively, what gives the verifier surprising power to detect deviations is the error-correcting properties of Enc (x) . 
Any good error-correcting code satisfies the property that if two strings x and x' differ in even one location, then 
Enc{x) and Enc(a;') differ in almost every location. In the same way, interactive proofs ensure that if V flips even a 
single bit of a single message in the protocol, then V either has to make an inconsistent claim at some later point, or 
else has to lie almost everywhere in her final claim about the value of the secret s. Thus, if the prover deviates from 
the prescribed protocol even once the verifier will detect this with high probability and reject. 
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2.3 Previous work 



Unfortunately, despite their power, IPs have had very little influence on real systems where integrity guarantees on out- 
sourced computation would be useful. There appears to have been a folklore belief that these methods are impractical 
119 1 . As previously mentioned, a recent line of work (e.g., [5,6,7,8,9, 10 , 19]) has made substantial progress in ad- 
vancing the practicality of these techniques. In particular, Goldwasser et al. ifTOl described a powerful general-purpose 
protocol (henceforth referred to as the GKR protocol) that achieves a polynomial -time prover and nearly linear-time 
verifier for a large class of computations. Very recently, Cormode, Mitzenmacher, and Thaler [8J showed how to 
significantly speed up the prover in the GKR protocol ifTOll . They also implemented this protocol, and demonstrated 
experimentally that their implementation approaches practicality. Even with their optimizations, the bottleneck in the 
implementation of 1 8 1 is the prover's runtime, with all other costs (such as verifier space and runtime) being extremely 
low. 

A related line of work has looked at protocols for specific streaming problems. Here, the goal is not just to save 
the verifier time (compared to doing the computation without a prover), but also to save the verifier space. This is 
motivated by cloud computing settings where the client does not even have space to store a local copy of the input, and 
thus uses the cloud to both store and process the data. The protocols developed in this line of work do not require the 
client to store the input, but rather allow the client to make a single streaming pass over the input (which can occur, for 
example, while the client is uploading data to the cloud). Throughout this paper, whenever we mention a streaming 
verifier, we mean the verifier makes a single pass over the input, and uses space significantly sublinear in the size of 
the data. 

The notion of a non-interactive streaming verifier was first put forth by Chakrabarti et al. ||6) and studied further by 
Cormode et al. fT\. These works allow the prover to send only a single message to the verifier (e.g., as an attachment 
to an email, or posted on a website), with no communication in the reverse direction. Moreover, these works present 
protocols achieving provably optimal tradeoffs between the size of the proof and the space used by the verifier for a 
variety of problems, ranging from matrix-vector multiplication to graph problems like bipartite perfect matching. 

Later, Cormode, Thaler, and Yi extended the streaming model of |6| to allow an interactive prover and verifier, 
who actually have a conversation. They demonstrated that interaction allows for much more efficient protocols in 
terms of client space, communication, and server running time than are possible in the one-message model of f6','7l. It 
was also observed in this work that the general-purpose GKR protocol works with just a streaming verifier. Finally, the 
aforementioned work of Cormode, Thaler, and Mitzenmacher |8| also showed how to use sophisticated Fast Fourier 
Transform (FFT) techniques to drastically speed up the prover's computation in the protocols of fSlJ?!. 

Also relevant is work by Setty etal. 1 19 1, who implemented a protocol for verifiable computation due to Ishai et al. 
llT3l . To set the stage for our results using parallelization, in Section|6]we compare our approach with 1 19 1 and |8 1 in 
detail. As a summary, the implementation of the GKR protocol described in both this paper and in 1 8 1 has several ad- 
vantages over 1 19 1. For example, the GKR implementation saves space and time for the verifier even when outsourcing 
a single computation, while lfT9]| saves time for the verifier only when batching together several dozen computations at 
once and amortizing the verifier's cost over the batch. Moreover, the GKR protocol is unconditionally secure against 
computationally unbounded adversaries who deviate from the prescribed protocol, while the Ishai et al. protocol relies 
on cryptographic assumptions to obtain security guarantees. We present experimental results demonstrating that that 
the prover in the sequential implementation of [8 1 based on the GKJl protocol runs significantly faster than the prover 
in the implementation of 1 19| based on the Ishai et al. protocol 1 13 1. 

Based on this comparison, we use the sequential implementation of |0 as our baseline. We then present results 
that our new GPU-based implementation runs 40-120x faster than the sequential implementation in |8|. 

3 Our interactive proof protocols 

In this section, we give an overview of the methods implemented in this paper Due to their highly technical nature, 
we seek only to convey a high-level description of the protocols relevant to this paper, and deliberately avoid rigorous 
definitions or theorems. We direct the interested reader to prior work for further details ll6l 171 ISl ITOl . 
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Figure 2; A small arithmetic circuit. 



3.1 GKR protocol 

The prover and verifier first agree on a layered arithmetic circuit of fan-in two over a finite field F computing the 
function of interest. An arithmetic circuit is just like a boolean circuit, except that the inputs are elements of F rather 
than boolean values, and the gates perform addition and multiplication over the field F, rather than computing AND, 
OR, and NOT operations. See Figure |2]for an example circuit. In fact, any boolean circuit can be transformed into an 
arithmetic circuit computing an equivalent function over a suitable finite field, although this approach may not yield 
the most succinct arithmetic circuit for the function. 

Suppose the output layer of the circuit is layer d, and the input layer is layer 0. The protocol of lITOl proceeds in 
iterations, with one iteration for each layer of the circuit. The first iteration follows the general outline described in 
Section [Z2] with V guiding V from a claim about the output of the circuit to a claim about a secret s, via a sequence 
of challenges and responses. The challenges sent by V to are simply random coins, which are interpreted as random 
points in the finite field F. The prescribed responses of V are polynomials, where each prescribed polynomial depends 
on the preceding challenge. Such a polynomial can be specified either by listing its coefficients, or by listing its 
evaluations at several points. 



However, unlike in Section 2.2 the secret s is not a symbol in an error-corrected encoding of the input, but rather 
a symbol in an error-corrected encoding of the gate values at layer d~ 1. Unfortunately, V cannot compute this secret 
s on her own. Doing so would require evaluating all previous layers of the circuit, and the whole point of outsourcing 
is to avoid this. So V has V tell her what s should be. But now V has to make sure that V is not lying about s. 

This is what the second iteration accomplishes, with V guiding V from a claim about s, to the claim about a new 
secret s', which is a symbol in an encoding of the gate values at layer d ~ 2. This continues until we get to the input 
layer At this point, the secret is actually a symbol in an error-corrected encoding of the input, and V can compute this 
secret in advance from the input easily on her own. Figure [Tjillustrates the entirety of the GKR protocol at a very high 
level. 

We take this opportunity to point out an important property of the protocol of fTOl, which was critical in allowing 
our GPU-based implementation to scale to large inputs. Namely, any iteration of the protocol involves only two layers 
of the circuit at a time. In the ith iteration, the verifier guides the prover from a claim about gate values at layer d~ i 
to a claim about gate values at layer d — i — 1. Gates at higher or lower layers do not affect the prescribed responses 
within iteration i. 



3.2 Special-purpose protocols 



As mentioned in Section |23] efficient problem-specific non-interactive verifiable protocols have been developed for a 
variety of problems of central importance in streaming and database processing, ranging from linear programming to 
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graph problems like shortest s — t path. The central primitive in many of these protocols is itself a protocol originally 
due to Chakrabarti et al. |6|, for a problem known as the second frequency moment, or F2. In this problem, the input 
is a sequence of m items from a universe U of size n, and the goal is to compute ^2(2^) = J2ieu fi' where fi is the 
number of times item i appears in the sequence. As explained in l8 ], speeding up this primitive immediately speeds 
up protocols for all of the problems that use the F2 protocol as a subroutine. 

The aforementioned F2 protocol of Chakrabarti et al. |6| achieves provably optimal tradeoffs between the length 
of the proof and the space used by the verifier Specifically, for any positive integer h, the protocol can achieve a proof 
length of just h machine words, as long as the verifier uses v — 0{n/h) words of space. For example, we may set 
both h and v to be roughly y/n, which is substantially sublinear in the input size n. 

Very roughly speaking, this protocol follows the same outline as in Section 2.2 except that in order to remove 
the interaction from the protocol, the verifier needs to compute a more complicated secret. Specifically, the verifier's 
secret s consists of v symbols in an error-corrected encoding of the input, rather than a single symbol. To compute the 
prescribed proof, the prover has to evaluate 2n symbols in the error-corrected encoding of the input. The key insight of 
fS I is that these 2n symbols need not be computed independently (which would require substantially superlinear time), 
but instead can be computed in 0(n log n) time using FFT techniques. More specifically, the protocol of |8| partitions 
the universe into av x h grid, and it performs a sophisticated FFT variant known as the Prime Factor Algorithm [4] on 
each row of the grid. The final step of Vs computation is to compute the sum of the squared entries for each column 
of the (transformed) grid; these values form the actual content of Vs prescribed message. 



4 Parallelizing our protocols 

In this section, we explain the insights necessary to parallelize the computation of both the prover and the verifier for 
the protocols we implemented. 

4.1 GKR protocol 

4.1.1 Parallelizing 7"s computation 

In every one of V's responses in the GKR protocol, the prescribed message from V is defined via a large sum over 
roughly terms, where S is the size of the circuit, and so computing this sum naively would take 17(5'^) time. 
Roughly speaking, Cormode et al. in IS) observe that each gate of the circuit contributes to only a single term of this 
sum, and thus this sum can be computed via a single pass over the relevant gates. The contribution of each gate to the 
sum can be computed in constant time, and each gate contributes to logarithmically many messages over the course 
of the protocol. Using these observations carefully reduces 7-"s runtime from il{S^), to 0{S log S), where again S is 
the circuit size. 

The same observation reveals that V's computation can be parallelized: each gate contributes independently to the 
sum in P's prescribed response. Therefore, V can compute the contribution of many gates in parallel, save the results 
in a temporary array, and use a parallel reduction to sum the results. We stress that all arithmetic is done within the 
finite field F, rather than over the integers. Figure|3]illustrates this process. 

4.1.2 Parallelizing Vs computation 

The bulk of Vs computation (by far) consists of computing her secret, which consists of a single symbol s in a 
particular error-corrected encoding of the input x. As observed in prior work |9 1, each symbol of the input contributes 
independently to s. Thus, V can compute the contribution of many input symbols in parallel, and sum the results via 
a parallel reduction, just as in the parallel implementation of T^'s computation. This speedup is perhaps of secondary 
importance, as V runs extremely quickly even in the sequential implementation of |8|. However, parallelizing V"s 
computation is still an appealing goal, especially as GPUs are becoming more common on personal computers and 
mobile devices. 
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Figure 3: Illustration of parallel computation of the server's message to the client in the GKR protocol. 



4.2 Special-purpose protocols 

4.2.1 Parallelizing T^'s computation 

Recall that the prover in the special-purpose protocols can compute the prescribed message by interpreting the input 
as a w X /i grid, where h is roughly the proof length and v is the amount of space used by the verifier The prover 
then performs a sophisticated FFT on each row of the grid independently. This can be parallelized by transforming 
multiple rows of the grid in parallel. Indeed, Cormode et al. ID achieved roughly a 7 x speedup for this problem by 
using all eight cores of a multicore processor Here, we obtain a much larger 20-50 x speedup using the GPU. (Note 
that [8J did not develop a parallel implementation of the GKR protocol, only of the special-purpose protocols). 

4.2.2 Parallelizing V's computation 

Recall that in the special-purpose protocols, the verifier's secret s consists of v symbols in an error-corrected encoding 



of the input, rather than a single symbol. Just as in Section 3.1 this computation can be parallelized by noting that 



each input symbol contributes independently to each entry of the encoded input. This requires V to store a large buffer 
of input symbols to work on in parallel. In some streaming contexts, V may not have the memory to accomplish this. 
Still, there are many settings in which this is feasible. For example, V may have several hundred megabytes of memory 
available, and seek to outsource processing of a stream that is many gigabytes or terabytes in length. Thus, parallel 
computation combined with buffering can help a streaming verifier keep up with a live stream of data: V splits her 
memory into two buffers, and at all times one buffer will be collecting arriving items. As long as V can process the 
full buffer (aided by parallelism) before her other buffer overflows, V will be able to keep up with the live data stream. 
Notice this discussion appUes to the client in the GKR protocol as well, as the GKJl protocol also enables a streaming 
verifier 



5 Architectural considerations 
5.1 GKR protocol 

The primary issue with any GPU-based implementation of the prover in the GKR protocol is that the computation is 
extremely memory-intensive: for a circuit of size 5* (which corresponds to S arithmetic operations in an unverifiable 
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algorithm), the prover in the GKR protocol has to store all S gates explicitly, because she needs the values of these gates 
to compute her prescribed messages. We investigate three alternative strategies for managing the memory overhead 
of the GKR protocol, which we refer to as the no-copying approach, the copy-once-per-layer approach, and the copy- 
every-message approach. 

5.1.1 The no-copying approach 

The simplest approach is to store the entire circuit explicitly on the GPU. We call this the no-copying approach. How- 
ever, this means that the entire circuit must fit in device memory, a requirement which is violated even for relatively 
small circuits, consisting of roughly tens of million of gates. 

5.1.2 The copy-once-per-layer approach 

Another approach is to keep the circuit in host memory, and only copy information to the device when it is needed. 
This is possible because, as mentioned in Section |3.1[ at any point in the protocol the prover only operates on two 
layers of the circuit at a time, so only two layers of the circuit need to reside in device memory. We refer to this as the 
copy-once-per-layer approach. This is the approach we used in the experiments in Section|6] 

Care must be taken with this approach to prevent host-to-device copying from becoming a bottleneck. Fortunately, 
in the protocol for each layer there are several dozen messages to be computed before the prover moves on to the next 
layer, and this ensures that the copying from host to device makes up a very small portion of the runtime. 

This method is sufficient to scale to very large circuits for all of the problems considered in the experimental 
section of |8|, since no single layer of the circuits is significantly larger than the problem input itself. However, this 
method remains problematic for circuits that have (one or several) layers which are particularly wide, as an explicit 
representation of all the gates within a single wide layer may still be too large to fit in device memory. 

5.1.3 The copy-every-message approach 

In the event that there are individual layers which are too large to reside in device memory, a third approach is to 
copy part of a layer at a time from the host to the device, and compute the contribution of each gate in the part to 
the prover's message before swapping the part back to host memory and bringing in the next part. We call this the 
copy-every-message approach. This approach is viable, but it raises a significant issue, alluded to in its name. Namely, 
this approach requires host-to-device copying for every message, rather than just once per layer of the circuit. That 
is, in any iteration i of the protocol, V cannot compute her jth message until after the {j — l)th challenge from V is 
received. Thus, for each message j, the entirety of the ith layer must be loaded piece-by-piece into device memory, 
swapping each piece back to host memory after the piece has been processed. In contrast, the copy-once-per-layer 
approach allows V to copy an entire layer i to the device and leave the entire layer in device memory for the entirety 
of iteration i (which will consist of several dozen messages). Thus, the slowdown inherent in the copy-every-message 
approach is not just that V has to break each layer into parts, but that V has to do host-to-device and device-to-host 
copying for each message, instead of copying an entire layer and computing several messages from that layer 

We leave implementing the copy-once-per-message approach in full for future work, but preliminary experiments 
suggest that this approach is viable in practice, resulting in less than a 3 x slowdown compared to the copy-once-per- 
layer approach. Notice that even after paying this slowdown, our GPU-based implementation would still achieve a 
10-40x speedup compared to the sequential implementation of ISjj. 

5.1.4 Memory access 

Recall that for each message in the ith iteration of the GKR protocol, we assign a thread to each gate g at the ith layer 
of the circuit, as each gate contributes independently to the prescribed message of the prover The contribution of gate 
g depends only on the index of g, the indices of the two gates feeding into g, and the values of the two gates feeding 
into g. 

Given this data, the contribution of gate g to the prescribed message can be computed using roughly 10-20 additions 
and multiplications within the finite field F (the precise number of arithmetic operations required varies over the course 
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of the iteration). As described in Section |6] we choose to work over a field which allows for extremely efficient 
arithmetic; for example, multiplying two field elements requires three machine multiplications of 64-bit data types, 
and a handful of additions and bit shifts. 

In all of the circuits we consider, the indices of g's in-neighbors can be determined with very little arithmetic 
and no global memory accesses. For example, if the wiring pattern of the circuit forms a binary tree, then the first 
in-neighbor of g has index 2 • index(5), and the second in-neighbor of g has index 2 • mdex{g) + 1. For each message, 
the thread assigned to g can compute this information from scratch without incurring any memory accesses. 

In contrast, obtaining the values of g's in-neighbors requires fetching 8 bytes per in-neighbor from global memory. 
Memory accesses are necessary because it is infeasible to compute the value of each gate's in-neighbors from scratch 
each message, and so we store these values explicitly. As these global memory accesses can be a bottleneck in the 
protocol, we strive to arrange the data in memory to ensure that adjacent threads access adjacent memory locations. To 
this end, for each layer i we maintain two separate arrays, with the j'th entry of the first (respectively, second) array 
storing the first (respectively, second) in-neighbor of the j'th gate at layer i. During iteration i, the thread assigned to 
the jth gate accesses location j of the first and second array to retrieve the value of its first and second in-neighbors 
respectively. This ensures that adjacent threads access adjacent memory locations. 

For all layers, the corresponding arrays are populated with in-neighbor values when we evaluate the circuit at the 
start of the protocol (we store each layer i's arrays on the host until the i'th iteration of the protocol, at which point 
we transfer the array from host memory to device memory as described in Section 5.1.2i. Notice this methodology 
sometimes requires data duplication: if many gates at layer i share the same in-neighbor gi, then gi's value will 
appear many times in layer i's arrays. We feel that slightly increased space usage is a reasonable price to pay to ensure 
memory coalescing. 



5.2 Special-purpose protocols 
5.2.1 Memory access 

Recall that the prover in our special-purpose protocols views the input as a v x h grid, and performs a sophisticated 
FFT on each row of the grid independently. Although the independence of calculations in each row offers abundant 
opportunities for task-parallelism, extracting the data-parallelism required for high performance on GPUs requires 
care due to the irregular memory access pattern of the specific FFT algorithm used. 

We observe that although each FFT has a highly irregular memory access pattern, this memory access pattern 
is data-independent. Thus, we can convert abundant task-parallelism into abundant data-parallelism by transposing 
the data grid into column-major rather than row-major order This simple transformation ensures perfect memory 
coalescing despite the irregular memory access pattern of each FFT, and improves the performance of our special- 
purpose prover by more than 10 x . 



6 Evaluation 

6.1 Implementation details 

Except where noted, we performed our experiments on an Intel Xeon 3 GHz workstation with 16 GB of host memory. 
Our workstation also has an NVIDIA GeForce GTX 480 GPU with 1.5 GB of device memory. We implemented all 
our GPU code in CUDA and Thrust 1 12 1 with all compiler optimizations turned on. 

Similar to the sequential implementations of |8|, both our implementation of the GKR protocol and the special- 
purpose F2 protocol due to EH] work over the finite field Fp with p = 2^^ — 1. We chose this field for a number 
of reasons. Firstly, the integers embed naturally within it. Secondly, the field is large enough that the probability the 
verifier fails to detect a cheating prover is tiny (roughly proportional to reciprocal of the field size). Thirdly, arithmetic 
within the field can be performed efficiently with simple shifts and bit- wise operations ll2n . We remark that we used 
no floating point operations were necessary in any of our implementations, because all arithmetic is done over finite 
fields. 
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Finally, we stress that in all reported costs below, we do count the time taken to copy data between the host and 
the device, and all reported speedups relative to sequential processing take this cost into account. We do not count the 
time to allocate memory for scratch space because this can be done in advance. 

6.2 Experimental methodology for the GKR protocol 

We ran our GPU-based implementation of the GKR protocol on four separate circuits, which together capture several 
different aspects of computation, from data aggregation to search, to linear algebra. The first three circuits were 
described and evaluated in |8| using the sequential implementation of the GKR protocol. The fourth problem was 
described and evaluated in lfT9l based on the Ishai et al. protocol |T31 . Below, [n] denotes the integers {0,1,..., n— 1}. 

• F2: Given a stream of m elements from [n], compute X]ig[n] ^1, where is the number of occurrences of i in 
the stream. 

• Fq: Given a stream of m elements from [rt], compute the number of distinct elements (i.e., the number of i with 
ai 7^ 0, where again ai is the number of occurrences of i in the stream). 

• PM: Given a stream representing text T — {to, . . . , t„_i) G [n]" and pattern P — {po, . . . ,Pq^i) £ [n]'^, the 
pattern P is said to occur at location i in t if, for every position j in P, pj — ti+j. The pattern-matching problem 
is to determine the number of locations at which P occurs in T. 

• MatMult: Given three matrices A, B,C £ [n]™^ , determine whether AB — C. (In practice, we do not expect 
C to truly be part of the input data stream. Rather, prior work ||9] |8] has shown that the GKR protocol works 
even if A and B are specified from a stream, while C is given later by P.) 

The first two problems, F2 and Fq, are classical data aggregation queries which have been studied for more than 
a decade in the data streaming community. Fq is also a highly useful subroutine in more complicated computations, 
as it effectively allows for equality testing of vectors or matrices (by subtracting two vectors and seeing if the result is 
equal to the zero vector). We make use of this subroutine when designing our matrix-multiplication circuit below. 

The third problem, PM, is a classic search problem, and is motivated, for example, by clients wishing to store 
(and search) their email on the cloud. Cormode et al. HI considered the PATTERN MATCHING WITH WILDCARDS 
problem, where the pattern and text can contain wildcard symbols that match with any character, but for simplicity we 
did not implement this additional functionality. 

We chose the fourth problem, matrix multiplication, for several reasons. First was its practical importance. Second 
was a desire to experiment on problems requiring super-linear time to solve (in contrast to F2 and Fq): running on a 
super-linear problem allowed us to demonstrate that our implementation as well as that of 1 8 1 saves the verifier time in 
addition to space, and it also forced us to grapple with the memory-intensive nature of the GKR protocol (see Section 
|4]l. Third was its status as a benchmark enabling us to compare the implementations of [8| and [19j. Although there 
are also efficient special-purpose protocols to verify matrix multiplication (see Freivald's algorithm lfT6l Section 7.1], 
as well as Chakrabarti et al. |6, Theorem 5.2]), it is still interesting to see how a general-purpose implementation 
performs on this problem. Finally, matrix multiplication is an attractive primitive to have at one's disposal when 
verifying more complicated computations using the GKR protocol. 

6.2.1 Description of circuits 

We briefly review the circuits for our benchmark problems. 

The circuit for F2 is by far the simplest (see Figure [4]for an illustration). This circuit simply computes the square 
of each input wire using a layer of multiplication gates, and then sums the results using a single sum-gate of very large 
fan-in. We remark that the GKR protocol typically assumes that all gates have fan-in two, but IH explains how the 
protocol can be modified to handle a single sum-gate of very large fan-in at the output. 

The circuit for Fq exploits Fermat's Little Theorem, which says that for prime p, aP~^ = 1 mod p if and only if 
a Q. Thus, this circuit computes the p — I'th power of each input wire (taking all non-zero inputs to 1, and leaving 
all 0-inputs at 0), and sums the results via a single sum-gate of high fan-in. 
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Figure 4: The circuit for F2. 



The circuit for PM is similar to that for Fq: essentially, for each possible location of the pattern, it computes a 
value that is if the pattern is at the location, and non-zero otherwise. It then computes the {p — l)th power of each 
such value and sums the results (i.e., it uses the circuit as a subroutine) to determine the number of locations where 
the pattern does (not) appear in the input. 

Our circuit for MatMult uses similar ideas. We could run a separate instance of the GKR protocol to verify each 
of the -n? entries in the output matrix AB and compare them to C, but this would be very expensive for both the client 
and the server. Instead, we specify a suitable circuit with a single output gate, allowing us to run a single instance 
of the protocol to verify the output. Our circuit computes the entries in AB via naive matrix multiplication, and 
subtracts the corresponding entry of C from each. It then computes the number of non-zero values using the Fq circuit 
as a subroutine. The final output of the circuit is zero if and only if C = AB. 



6.2.2 Scaling to large inputs 

As described in Section [5] the memory-intensive nature of the GKR protocol made it challenging to scale to large 
inputs, especially given the limited amount of device memory. Indeed, with the no-copying approach (where we 
simply keep the entire circuit in device memory), we were only able to scale to inputs of size roughly 150, 000 for the 
Fq problem, and to 32 x 32 matrices for the MatMult problem on a machine with 1 GB of device memory. Using 
the copy-once-per-layer approach, we were able to scale to inputs with over 2 million entries for the Fq problem, and 
128 X 128 matrices for the MatMult problem. By running on a NVIDIA Tesla C2070 GPU with 6 GBs of device 
memory, we were able to push to 256 x 256 matrices for the MatMult problem; the data from this experiment is 
reported in Table |2] 



6.2.3 Evaluation of previous implementations 

To our knowledge, the only existing implementation for verifiable computation that can be directly compared to 
that of Cormode et al. ||8] is that of Setty et al. fT9l. We therefore performed a brief comparison of the sequential 
implementation of fSl with that of f\9\. This provides important context in which to evaluate our results: our 40-120x 
speedups compared to the sequential implementation of |8 | would be less interesting if the sequential implementation 
of ISl was slower than alternative methods. Prior to this paper, these implementations had never been run on the same 
problems, so we picked a benchmark problem (matrix multiplication) evaluated in |19| and compared to the results 
reported there. 

We stress that our goal is not to provide a rigorous quantitative comparison of the two implementations. Indeed, we 
only compare the implementation of |8 1 to the numbers reported in 1 19 1; we never ran the implementations on the same 
system, leaving this more rigorous comparison for future work. Moreover, both implementations may be amenable to 
further optimization. Despite these caveats, the comparison between the two implementations seems clear The results 
are summarized in Table [T] 
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Implementation 


Matrix Size 


V Time 


V Time 


Total Communication 


lH 


512 X 512 


3.11 hours 


0.12 seconds 


138.1 KB 


fT9l, Pepper 


400 X 400 


8.1 years* 


14 hours* 


Not Reported 


fl9|, Habanero 


400 X 400 


17 days^ 


2.1 minutes^ 


17.1 GBt 



Table 1: Comparison of the costs for the sequential implementations of fSl and f\9\. Entries marked with * indicate 
that the costs given are total costs over 45,000 queries. Entries marked with ^ indicate that the costs are total costs 
over 111 queries. 



Problem 


Input Size 


Circuit Size 


GPUP 


Sequential 


Circuit 


GPU V 


Sequential 


Unverified 




(number of 


(number of 


Time (s) 


V Time (s) 


Evaluation 


Time (s) 


V Time (s) 


Algorithm 




entries) 


gates) 






Time (s) 






Time (s) 


F2 


8.4 million 


25.2 million 


3.7 


424.6 


0.1 


0.035 


3.600 


0.028 


Fo 


2.1 million 


255.8 million 


128.5 


8,268.0 


4.2 


0.009 


0.826 


0.005 


PM 


524,288 


76.0 million 


38.9 


1,893.1 


1.2 


0.004 


0.124 


0.006 


MatMult 


65,536 


42.3 million 


39.6 


1,658.0 


0.9 


0.003 


0.045 


0.080 



Table 2: Prover runtimes in the GKR protocol for all four problems considered. 



In Table [T] Pepper refers to an implementation in fT9l which is actually proven secure against polynomial-time ad- 
versaries under cryptographic assumptions, while Habenero is an implementation in [19] which runs faster by allowing 
for a very high soundness probability of | that a deviating prover can fool the verifier, and utilizing what the authors 
themselves refer to as heuristics (not proven secure in (19], though the authors indicate this may be due to space con- 
straints). In contrast, the soundness probability in the implementation of |8| is roughly (roughly proportional to 
the reciprocal of the field size p = 2^^ — 1), and the protocol is unconditionally secure even against computationally 
unbounded adversaries. 

The implementation of lfT9l has very high set-up costs for both V and V, and therefore the costs of a single query 
are very high. But this set-up cost can be amortized over many queries, and the most detailed experimental results 
provided in [TW] give the costs for batches of hundreds or thousands of queries. The costs reported in the second and 
third rows of Table[T]are therefore the total costs of the implementation when run on a large number of queries. 

When we run the implementation of IH on a single 512 x 512 matrix, the server takes 3.11 hours, the client takes 
0.12 seconds, and the total length of all messages transmitted between the two parties is 138.1 KB. In contrast, the 
server in the heuristic implementation of I.19J, Habanero, requires 17 days amortized over 111 queries when run on 
considerably smaller matrices (400 x 400). This translates to roughly 3.7 hours per query, but the cost of a single 
query without batching is likely about two orders of magnitude higher. The client in Habanero requires 2.1 minutes to 
process the same 111 queries, or a little over 1 second per query, while the total communication is 17. 1 GBs, or about 
157 MBs per query. Again, the per query costs will be roughly two orders of magnitude higher without the batching. 

We conclude that, even under large batching the per-query time for the server of the sequential implementation 
of |8| is competitive with the heuristic implementation of |,19J , while the per-query time for the verifier is about two 
orders of magnitude smaller, and the per-query communication cost is between two and three orders of magnitude 
smaller Without the batching, the per-query time of |8l is roughly 100 x smaller for the server and l,000x smaller 
for the client, and the communication cost is about 1 00,000 x smaller. 

Likewise, the implementation of JSl is over 5 orders of magnitude faster for the client than the non-heuristic 
implementation Pepper, and four orders of magnitude faster for the server. 

6.2.4 Evaluation of our GPU-based implementation 

Figure |5] demonstrates the performance of our GPU-based implementation of the GKR protocol. Table |2] also gives a 
succinct summary of our results, showing the costs for the largest instance of each problem we ran on. We consider 
the main takeaways of our experiments to be the following. 
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Figure 5: Comparison of prover and verifier runtimes between the sequential implementation of the GKR protocol due 
to ID and our GPU-based implementation. Note that all plots are on a log-log scale. Plots (a), (b), (c), and (d) depict the 
prover runtimes for Fq, F2, PM, MatMult respectively. Plot (e) depicts the verifier runtimes for the GKR protocol. 
We include only one plot for the verifier, since its dominant cost in the GKR protocol is problem-independent. 



Server-side speedup obtained by GPU computing. Compared to the sequential implementation of ID, our GPU- 
based server implementation ran close to 115 x faster for the F2 circuit, about 60 x faster for the Fq circuit, 45 x faster 
for PM, and about 40 x faster for MatMult (see Figure |5]). 

Notice that for the first three problems, we need to look at large inputs to see the asymptotic behavior of the 
curve corresponding to the parallel prover's runtime. Due to the log-log scale in Figure |5] the curves for both the 
sequential and parallel implementations are asymptotically linear, and the 45-120x speedup obtained by our GPU- 
based implementation is manifested as an additive gap between the two curves. The explanation for this is simple: 
there is considerable overhead relative to the total computation time in parallelizing the computation at small inputs, 
but this overhead is more effectively amortized as the input size grows. 

In contrast, notice that for MatMult the slope of the curve for the parallel prover remains significantly smaller 
than that of the sequential prover throughout the entire plot. This is because our GPU-based implementation ran out 
of device memory well before the overhead in parallelizing the prover's computation became negligible. We therefore 
believe the speedup for MatMult would be somewhat higher than the 40 x speedup observed if we were able to run 
on larger inputs. 

Could a parallel verifiable program be faster than a sequential unverifiable one? The very first step of the prover's 
computation in the GKR protocol is to evaluate the circuit. In theory this can be done efficiently in parallel, by proceed- 
ing sequentially layer by layer and evaluating all gates at a given layer in parallel. However, in practice we observed 
that the time it takes to copy the circuit to the device exceeds the time it takes to evaluate the circuit sequentially. This 
observation suggests that on the current generation of GPUs, no GPU-based implementation of the prover could run 
faster than a sequential unverifiable algorithm. This is because sequentially evaluating the circuit takes at least as long 
as the unverifiable sequential algorithm, and copying the data to the GPU takes longer than sequentially evaluating the 



13 



circuit. This observation applies not just to the GKR protocol, but to any protocol that uses a circuit representation of 
the computation (which is a standard technique in the theory literature fT3^,T8^|). Nonetheless, we can certainly hope 
to obtain a GPU-based implementation that is competitive with sequential unverifiable algorithms. 

Server-side slowdown relative to unverifiable sequential algorithms. For F2, the total slowdown for the prover 
was roughly 130x (3.7 seconds compared to 0.028 seconds for the unverifiable algorithm, which simply iterates over 
all entries of the frequency vector and computes the sum of the squares of each entry). We stress that it is likely 
that we overestimate the slowdown resulting from our protocol, because we did not count the time it takes for the 
unverifiable implementation to compute the number of occurrences of each item i, that is, to aggregate the stream into 
its frequency vector representation (ai, . . . , a„). Instead, we simply generated the vector of frequencies at random 
(we did not count the generation time), and calculated the time to compute the sum of their squares. In practice, this 
aggregation step may take much longer than the time required to compute the sum of the squared frequencies once the 
stream is in aggregated form. 

For Fq, our GPU-based server implementation ran roughly 25,000x slower than the obvious unverifiable algorithm 
which simply counts the number of non-zero items in a vector. The larger slowdown compared to the F2 problem is 
unsurprising. Since Fq is a less arithmetic problem than F2, its circuit representation is much larger. Once again, it is 
likely that we overestimate the slowdowns for this problem, as we did not count the time for an unverifiable algorithm 
to aggregate the stream into its frequency-vector representation. Despite the substantial slow-down incurred for Fq 
compared to a naive unverifiable algorithm, it remains valuable as a primitive for use in heavier-duty computations 
like PM and MatMult. 

For PM, the bulk of the circuit consists of a Fq sub-routine, and so the runtime of our GPU-based implementation 
was similar to those for Fq. However, the sequential unverifiable algorithm for PM takes longer than that for Fq. 
Thus, our GPU-based server implementation ran roughly 6,500 x slower than the naive unverifiable algorithm, which 
exhaustively searches all possible locations for occurrences of the pattern. 

For MatMult, our GPU-based server implementation ran roughly 500 x slower than naive matrix-multiplication 
for 256 X 256 matrices. Moreover, this number is likely inflated due to cache effects from which the naive unverifiable 
algorithm benefited. That is, the naive unverifiable algorithm takes only 0.09 seconds for 256 x 256 matrices, but takes 
7.1 seconds for 512 x 512 matrices, likely because the algorithm experiences very few cache misses on the smaller 
matrix. We therefore expect the slowdown of our implementation to fall to under lOOx if we were to scale to larger 
matrices. Furthermore, the GKR protocol is capable of verifying matrix-multiplication over the finite field Fp rather 
than over the integers at no additional cost. Naive matrix-multiplication over this field is between 2-3 x slower than 
matrix multiplication over the integers (even using the fast arithmetic operations available for this field). Thus, if our 
goal was to work over this finite field rather than the integers, our slowdown would fall by another 2-3 x . It is therefore 
possible that our server-side slowdown may be less than 50x at larger inputs compared to naive matrix multiplication 
over Fp. 

Client-side speedup obtained by GPU computing. The bulk of V's computation consists of evaluating a single 
symbol in an error-corrected encoding of the input; this computation is independent of the circuit being verified. For 
reasonably large inputs (see the row for F2 in Table |2]i, our GPU-based client implementation performed this compu- 
tation over lOOx faster than the sequential implementation of fSl. For smaller inputs the speedup was unsurprisingly 
smaller due to increased overhead relative to total computation time. Still, we obtained a 15x speedup even for an 
input of length 65,536 (256 x 256 matrix multiplication). 

Client-side speedup relative to unverifiable sequential algorithms. Our matrix-multiplication results clearly demon- 
strate that for problems requiring super-linear time to solve, even the sequential implementation of |8 1 will save the 
client time compared to doing the computation locally. Indeed, the runtime of the client is dominated by the cost of 
evaluating a single symbol in an error-corrected encoding of the input, and this cost grows linearly with the input size. 
Even for relatively small matrices of size 256 x 256, the client in the implementation of |8| saved time. For matrices 
with tens of millions of entries, our results demonstrate that the client will still take just a few seconds, while perform- 
ing the matrix multiplication computation would require orders of magnitude more time. Our results demonstrate that 
GPU computing can be used to reduce the verifier's computation time by another lOOx. 
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Figure 6: Comparison of prover (a) and verifier (b) runtimes in the sequential and GPU-based implementations of the 
special-purpose F2 protocol. Note that all plots are on a log-log scale. Throughout, the verifier's space usage and the 
proof length are both set to ^Jn. 



V space 
(KB) 


Proof length 
(KB) 


Time (s) 


Sequential V 
Time (s) 


GPU V 
Time (s) 


Sequential V 
Time (s) 


39.1 


78.1 


2.901 


43.773 


0.019 


0.858 


78.2 


39.1 


1.872 


43.544 


0.010 


0.639 


156.5 


19.5 


1.154 


37.254 


0.010 


0.577 


313.2 


9.8 


0.909 


36.554 


0.008 


0.552 


1953.1 


0.78 


0.357 


20.658 


0.007 


0.551 



Table 3: Prover and verifier runtimes for the special-purpose F2 protocol. All results are for fixed universe size n = 25 
million, varying the tradeoff between proof length and the client's space usage. This universe size corresponds to 190.7 
MB of data. 



6.3 Special-purpose protocols. 

We implemented both the client and the server of the non-interactive F2 protocol of |6 8 1 on the GPU. As described in 
Section 2.3, this protocol is the fundamental building block for a host of non-interactive protocols achieving optimal 
tradeoffs between the space usage of the client and the length of the proof. Figure |6] demonstrates the performance 
of our GPU-based implementation of this protocol. Our GPU implementation obtained a 20-50 x server-side speedup 
relative to the sequential implementation of |8|. This speedup was only possible after transposing the data grid into 



column-major order so as to achieve perfect memory coalescing, as described in Section 5.2.1 

The server-side speedups we observed depended on the desired tradeoff between proof length and space usage. 
That is, the protocol partitions the universe [n] into a v x h grid where h is roughly the proof length and v is the 
verifier's space usage. The prover processes each row of the grid independently (many rows in parallel). When v is 
large, each row requires a substantial amount of processing. In this case, the overhead of parallelization is effectively 
amortized over the total computation time. If v is smaller, then the overhead is less effectively amortized and we see 
less impressive speedups. 

We note that Figure [6] depicts the prover runtime for both the sequential implementation of |8| and our GPU- 
based implementation with the parameters h = v = y/n. With these parameters, our GPU-based implementation 
achieved roughly a 20x speedup relative to the sequential program. Table [3] shows the costs of the protocol for fixed 
universe size n = 25 million as we vary the tradeoff between h and v. The data in this table shows that our parallel 
implementation enjoys a 40-60 x speedup relative to the sequential implementation when v is substantially larger than 
h. This indicates that we would see similar speedups even when h = v = y/n if we scaled to larger input sizes 
n. Notice that universe size n — 25 million corresponds to over 190 MBs of data, while the verifier's space usage 



15 



and the proof length are hundreds or thousands of times smaller in all our experiments. An unverifiable sequential 
algorithm for computing the second frequency moment over this universe required 0.031 seconds; thus, our parallel 
server implementation achieved a slowdown of 10-lOOx relative to an unverifiable algorithm. 

In contrast, the verifier's computation was much easier to parallelize, as its memory access pattern is highly regular. 
Our GPU -based implementation obtained 40-70 x speedups relative to the sequential verifier of |8| across all input 
lengths n, including when we set h = v = y/n. 

7 Conclusions 

This paper adds to a growing line of work focused on obtaining fully practical methods for verifiable computation. 
Our primary contribution in this paper was in demonstrating the power of parallelization, and GPU computing in 
particular, to obtain robust speedups in some of the most promising protocols in this area. We believe the additional 
costs of obtaining correctness guarantees demonstrated in this paper would already be considered modest in many 
correctness-critical appUcations. Moreover, it seems likely that future advances in interactive proof methodology will 
also be amenable to parallelization. This is because the protocols we implement utilize a number of common primitives 
(such as the sum-check protocol [15J ) as subroutines, and these primitives are likely to appear in future protocols as 
well. 

Several avenues for future work suggest themselves. First, the GKR protocol is rather inefficient for the prover 
when applied to computations which are non-arithmetic in nature, as the circuit representation of such a computa- 
tion is necessarily large. Developing improved protocols for such problems (even special-purpose ones) would be 
interesting. Prime candidates include many graph problems like minimum spanning tree and perfect matching. More 
generally, a top priority is to further reduce the slowdown or the memory-intensity for the prover in general-purpose 
protocols. Both these goals could be accomplished by developing an entirely new construction that avoids the circuit 
representation of the computation; it is also possible that the the prover within the GKR construction can be further 
optimized without fundamentally altering the protocol. 
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